forked from k8snetworkplumbingwg/sriov-network-operator
-
Notifications
You must be signed in to change notification settings - Fork 0
Sync with upstream main #6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
clarkzinzow
merged 103 commits into
togethercomputer:master
from
k8snetworkplumbingwg:master
Jan 31, 2025
Merged
Sync with upstream main #6
clarkzinzow
merged 103 commits into
togethercomputer:master
from
k8snetworkplumbingwg:master
Jan 31, 2025
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
PrometheusRules allow recording pre-defined queries. Use `sriov_kubepoddevice` metric to add `pod|namespace` pair to the sriov metrics. Feature is enabled via the `METRICS_EXPORTER_PROMETHEUS_DEPLOY_RULE` environment variable. Signed-off-by: Andrea Panattoni <apanatto@redhat.com>
When the `metricsExporter` feature is turned off, deployed resources
should be removed. These changes fix the error:
```
│ 2024-08-28T14:07:57.699760017Z ERROR controller/controller.go:266 Reconciler error {"controller": "sriovoperatorconfig", "controllerGroup": "sriovnetwork.openshift.io", "controllerKind": "SriovOperatorConfig", "SriovOperatorConfig": {"name":"default","namespace":"openshift-sriov-network-operator"}, │
│ "namespace": "openshift-sriov-network-operator", "name": "default", "reconcileID": "fa841c50-dbb8-4c4c-9ddd-b98624fd2a24", "error": "failed to delete object &{map[apiVersion:monitoring.coreos.com/v1 kind:ServiceMonitor metadata:map[name:sriov-network-metrics-exporter namespace:openshift-sriov-network-operator] │
│ spec:map[endpoints:[map[bearerTokenFile:/var/run/secrets/kubernetes.io/serviceaccount/token honorLabels:true interval:30s port:sriov-network-metrics scheme:https tlsConfig:map[caFile:/etc/prometheus/configmaps/serving-certs-ca-bundle/service-ca.crt insecureSkipVerify:false serverName:sriov-network-metrics-expor │
│ ter-service.openshift-sriov-network-operator.svc]]] namespaceSelector:map[matchNames:[openshift-sriov-network-operator]] selector:map[matchLabels:map[name:sriov-network-metrics-exporter-service]]]]} with err: could not delete object (monitoring.coreos.com/v1, Kind=ServiceMonitor) openshift-sriov-network-operato │
│ r/sriov-network-metrics-exporter: servicemonitors.monitoring.coreos.com \"sriov-network-metrics-exporter\" is forbidden: User \"system:serviceaccount:openshift-sriov-network-operator:sriov-network-operator\" cannot delete resource \"servicemonitors\" in API group \"monitoring.coreos.com\" in the namespace \"ope │
│ nshift-sriov-network-operator\""}
```
Signed-off-by: Andrea Panattoni <apanatto@redhat.com>
[metrics 4/x] Metrics exporter rules
Refactor some conformance tests to use `SRIOV_NODE_AND_DEVICE_NAME_FILTER`
if the current obj as annotation and the updated doesn't we still want to add the ones from the current object Signed-off-by: Sebastian Sch <sebassch@gmail.com>
When a user deletes the default SriovOperatorConfig resource and tries to recreate it afterwards, the operator webhooks returns the error: ``` Error from server (InternalError): error when creating "/tmp/opconfig.yml": Internal error occurred: failed calling webhook "operator-webhook.sriovnetwork.openshift.io": failed to call webhook: Post "https://operator-webhook-service.openshift-sriov-network-operator.svc:443/validating-custom-resource?timeout=10s": service "operator-webhook-service" not found ``` as the webhook configuration is still present, while the Service and the DaemonSet has been deleted. Delete all the webhook configurations when the user deletes the default SriovOperatorConfig Signed-off-by: Andrea Panattoni <apanatto@redhat.com>
Fix merge annotation function
Signed-off-by: Andrea Panattoni <apanatto@redhat.com>
The bash syntax was incorrect and yielded:
hack/env.sh: line 35: ${$RDMA_CNI_IMAGE:-}: bad substitution
Fix syntax for RDMA_CNI_IMAGE var substitution
It might happen that two SR-IOV pods, deployed on different node, are using devices
with the same PCI address. In such cases, the query suggested [1] by the sriov-network-metrics-exporter produces the error:
```
Error loading values found duplicate series for the match group {pciAddr="0000:3b:02.4"} on the right hand-side of the operation:
[
{
__name__="sriov_kubepoddevice",
container="test",
dev_type="openshift.io/intelnetdevice",
endpoint="sriov-network-metrics",
instance="10.1.98.60:9110",
job="sriov-network-metrics-exporter-service",
namespace="cnf-4916",
pciAddr="0000:3b:02.4",
pod="pod-cnfdr22.telco5g.eng.rdu2.redhat.com",
prometheus="openshift-monitoring/k8s",
service="sriov-network-metrics-exporter-service"
}, {
__name__="sriov_kubepoddevice",
container="test",
dev_type="openshift.io/intelnetdevice",
endpoint="sriov-network-metrics",
instance="10.1.98.230:9110",
job="sriov-network-metrics-exporter-service",
namespace="cnf-4916",
pciAddr="0000:3b:02.4",
pod="pod-dhcp-98-230.telco5g.eng.rdu2.redhat.com",
prometheus="openshift-monitoring/k8s",
service="sriov-network-metrics-exporter-service"
}
];many-to-many matching not allowed: matching labels must be unique on one side
```
Configure the ServiceMonitor resource to add a `node` label to all metrics.
The right query to get metrics, as updated in the PrometheusRule, will be:
```
sriov_vf_tx_packets * on (pciAddr,node) group_left(pod,namespace,dev_type) sriov_kubepoddevice
```
Also remove `pod`, `namespace` and `container` label from the `sriov_vf_*` metrics, as they were
wrongly set to `sriov-network-metrics-exporter-zj2n9`, `openshift-sriov-network-operator`, `kube-rbac-proxy`
[1] https://github.com/k8snetworkplumbingwg/sriov-network-metrics-exporter/blob/0f6a784f377ede87b95f31e569116ceb9775b5b9/README.md?plain=1#L38
Signed-off-by: Andrea Panattoni <apanatto@redhat.com>
When we want to use config-drive in immutable systems, very often the config-drive is only used at boot and then umounted (e.g. ignition does this). Later when we want to fetch Metadata from the config drive, we actually have to mount it. In this PR, I'm adding similar code than coreos/ignition where we dynamically mount the config-drive is the device was found with the right label (config-2 or CONFIG-2 as documented in OpenStack). If the device is found, we mount it, fetch the data and umount it.
Fixes the following shellcheck error:
SC2068 (error): Double quote array expansions to avoid re-splitting elements.
https://www.shellcheck.net/wiki/SC2068
Fixes the following shellcheck error:
SC2148 (error): Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.
https://www.shellcheck.net/wiki/SC2148
Fixes the following shellcheck errors:
SC2145 (error): Argument mixes string and array. Use * or separate argument.
SC2199 (error): Arrays implicitly concatenate in [[ ]]. Use a loop (or explicit * instead of @).
https://www.shellcheck.net/wiki/SC2145
https://www.shellcheck.net/wiki/SC2199
Also fixes a typo in SUPPORTED_INTERFACE_SWITCHER_MODES.
Fixes the following shellcheck error:
SC2045 (error): Iterating over ls output is fragile. Use globs.
https://www.shellcheck.net/wiki/SC2045
On some kernels GetDevlinkDeviceParam may return empty values for some kernel parameters. The netlink library is able to handle this, but the code in GetDevlinkDeviceParam function may panic if unexpected value received. Add extra checks to avoid panics
Delete webhooks when SriovOperatorConfig is deleted
Fix: GetDevlinkDeviceParam to handle edge-cases correctly
[metrics 5/x] Add node label to sriov_* metrics
`sriov_kubepoddevice` metric might end up in the Prometheus database after a while, as the default scrape interval is 30s. This leads to failures in the end-to-end lane like: ``` [sriov] Metrics Exporter When Prometheus operator is available [It] Metrics should have the correct labels /root/opr-ocp2-1/data/sriov-network-operator/sriov-network-operator/test/conformance/tests/test_exporter_metrics.go:132 [FAILED] no value for metric sriov_kubepoddevice ``` Put the metric assertion in an `Eventually` statement Signed-off-by: Andrea Panattoni <apanatto@redhat.com>
Signed-off-by: Sebastian Sch <sebassch@gmail.com>
Fix NRI rbac
Fixes the following shellcheck error:
SC2081 (error): [ .. ] can't match globs. Use a case statement.
https://www.shellcheck.net/wiki/SC2081
Warns about shellcheck issues with severity `error`.
metrics: Fix `Metrics should have the correct labels` test
CI: Add a bash linter to pre-submits
openstack: dynamically mount the config-drive
When the operator changes the device-plugin Spec (e.g. .Spec.NodeSelector), it may happen that there are two device plugin pods for a given node, one that is terminating, the other that is initializing. If the config-daemon executes `restartDevicePluginPod()` at the same time, it may kill the terminating pod, while the initializing one will run with the old dp configuration. This may cause one or more resources to not being advertised, until a manual device plugin restart occurs. Make the config-daemon restart all the device-plugin instances it founds for its own node. Signed-off-by: Andrea Panattoni <apanatto@redhat.com>
Signed-off-by: Ivan Kolodiazhnyi <ikolodiazhny@nvidia.com>
this is needed because after a reboot on a single node the operator webhook may not be ready Signed-off-by: Sebastian Sch <sebassch@gmail.com>
functest: add retry for rdma functionel test
…SET block When "$SKIP_VAR_SET" is unset and the environment variables fallback to the default, the check for valid values should be done. Move the check out of the $SKIP_VAR_SET block for that. For the current "hack/env.sh" this maybe not make an actual difference, because probably the code to assign default values will ensure that always valid value are set. Note that the openshift variant of the above code will detect the default via skopeo, which can fail. For that reason, this change makes more sense for openshift. However, also for the current code, performing the same error checking after filling out default values, ensures that the detected values are considered valid Even if that is in fact always the case, it's not entirely trivial to see.
[CVE-2024-45338](GHSA-w32m-9786-jp63) Signed-off-by: Andrea Panattoni <apanatto@redhat.com>
bump `golang.org/x/net` to `v0.33.0`
if we run on a system where the PF is not connected to the network we can still use it for tests but we need the link state to not be auto. Signed-off-by: Sebastian Sch <sebassch@gmail.com>
This will fix the issue we sometime see ` <string>: Dump was interrupted and may be inconsistent.\n` https://docs.kernel.org/userspace-api/netlink/intro.html#dump-consistency Signed-off-by: Sebastian Sch <sebassch@gmail.com>
functest: Fix ip link command output
add link state enable on test
It's enouph to configure ib_core module in /etc/moprobe.d/ for Ubuntu OS to change RDMA subsystem mode. Also this commit add OS check into kargs.sh error because 'grubby' isn't available in official Ubuntu repositories. Kernel param configuration support in Ubuntu should be implemented in a separate commit. Signed-off-by: Ivan Kolodiazhnyi <ikolodiazhny@nvidia.com>
Signed-off-by: Sebastian Sch <sebassch@gmail.com>
feat: Update controller logic to handle stale SriovNetworkNodeState CRs with delay
Skip kernel parameters configuration for Ubuntu
Bump k8s version ci
Signed-off-by: Ivan Kolodiazhnyi <ikolodiazhny@nvidia.com>
For using non-default MTU, OVS supports "mtu_request" field when adding a port to the bridge. eg: https://docs.openvswitch.org/en/latest/topics/dpdk/jumbo-frames/ Signed-off-by: Fred Rolland <frolland@nvidia.com>
with the introduction of rdma system mode change on baremetal systems it takes more than 1h that is the default for ginkgo Signed-off-by: Sebastian Sch <sebassch@gmail.com>
extend func-test timeout
[th/hack-env-check] hack/env.sh: move checking of environment variables outside SKIP_VAR_SET block
Support mtu_request for OVS
When creating a bridge with ovs-vsctl, an internal interface is added by default. The same behavior is added in this commit ovs-vsctl code ref: https://github.com/openvswitch/ovs/blob/main/utilities/ovs-vsctl.c#L1597 Signed-off-by: Fred Rolland <frolland@nvidia.com>
ovs: add internal interface
Do not configure BlueField NICs in DPU mode
Signed-off-by: Sebastian Sch <sebassch@gmail.com>
Rdma functional tests improvements
|
Thanks for your PR,
To skip the vendors CIs, Maintainers can use one of:
|
punkerpunker
approved these changes
Jan 31, 2025
punkerpunker
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🫡
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.